Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

algod importer: Update sync on WaitForBlock error. #122

Merged
merged 11 commits into from
Jul 21, 2023

Conversation

winder
Copy link
Contributor

@winder winder commented Jul 20, 2023

Summary

If algod is restarted after it receives a sync round update but before it fetches the new round(s), then the algod follower and conduit will stall. Conduit will keep waiting for algod to reach the new sync round but it never happens.

This change adds some extra logic to the WaitForBlock call. If there is a timeout or a bad response, a new attempt to set the sync round is made.

This PR also removes the retry loop from the algod importer. Retry is now managed by the pipeline.

Test Plan

Update existing unit tests.

@winder winder added Team Lamprey Bug-Fix PR proposing to fix a bug labels Jul 20, 2023
@winder winder requested review from tzaffi, Eric-Warehime and a team July 20, 2023 16:40
@winder winder self-assigned this Jul 20, 2023
@codecov
Copy link

codecov bot commented Jul 20, 2023

Codecov Report

Merging #122 (042ec0b) into master (442791a) will increase coverage by 2.71%.
The diff coverage is 77.26%.

@@            Coverage Diff             @@
##           master     #122      +/-   ##
==========================================
+ Coverage   67.66%   70.37%   +2.71%     
==========================================
  Files          32       36       +4     
  Lines        1976     2535     +559     
==========================================
+ Hits         1337     1784     +447     
- Misses        570      654      +84     
- Partials       69       97      +28     
Impacted Files Coverage Δ
conduit/data/block_export_data.go 100.00% <ø> (+92.30%) ⬆️
conduit/metrics/metrics.go 100.00% <ø> (ø)
conduit/pipeline/metadata.go 69.11% <ø> (ø)
...duit/plugins/exporters/filewriter/file_exporter.go 81.63% <ø> (-1.06%) ⬇️
conduit/plugins/exporters/postgresql/util/prune.go 78.43% <ø> (ø)
conduit/plugins/importers/algod/metrics.go 100.00% <ø> (ø)
...ins/processors/filterprocessor/filter_processor.go 83.82% <ø> (+3.54%) ⬆️
...plugins/processors/filterprocessor/gen/generate.go 34.28% <ø> (ø)
conduit/plugins/processors/noop/noop_processor.go 64.70% <ø> (+6.81%) ⬆️
pkg/cli/internal/list/list.go 20.75% <ø> (ø)
... and 15 more

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@winder winder marked this pull request as ready for review July 20, 2023 17:08
Copy link
Contributor

@Eric-Warehime Eric-Warehime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct to me. I'm not sure I follow how this causes the pipeline to hang though.

Last I checked if you stop/start the node it will have the last MaxAcctLookback deltas in cache (and even more rounds available). And it will also run ahead MaxAcctLookback-1 rounds.

So unless that number is 1, the node/pipeline should make progress despite the sync round being 1 round lower than what we expect. And the pipeline would correctly update the sync round once it processed another round.

@winder
Copy link
Contributor Author

winder commented Jul 20, 2023

Looks correct to me. I'm not sure I follow how this causes the pipeline to hang though.

Last I checked if you stop/start the node it will have the last MaxAcctLookback deltas in cache (and even more rounds available). And it will also run ahead MaxAcctLookback-1 rounds.

So unless that number is 1, the node/pipeline should make progress despite the sync round being 1 round lower than what we expect. And the pipeline would correctly update the sync round once it processed another round.

I don't totally understand it either. I'm guessing there is some sort of cooldown / warmup time when rounds are being processed very quickly. For the file processor each round is being processed in the 50-200µs range.

I was able to confirm that it's the case that the sync round needs to be called (this is with MaxAcctLookback = 64):

cat metadata.json
{"genesis-hash":"mFgazF+2uRS1tMiL9dsj01hJGySEmPN28B/TjjvpVW0=","network":"betanet","next-round":609262}

curl -XGET "localhost:4190/v2/ledger/sync?pretty" -H "Authorization: Bearer aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
{
  "round": 609198
}

@tzaffi
Copy link
Contributor

tzaffi commented Jul 20, 2023

This PR also removes the retry loop from the algod importer. Retry is now managed by the pipeline.

👍

Copy link
Contributor

@tzaffi tzaffi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me.

I made suggestions about rewording some comments, possibly using errors.Join, and keeping a higher timeout value.

The following thought strikes me. During yesterday's standup it sounded like around 1 in 3 shutdowns of algod during catchup would result in this bug. So if one brings algod down and up enough times we could practically guarantee the bug. It may even be possible to simulate this issue reliably in our short-duration E2E tests.

conduit/plugins/importers/algod/algod_importer.go Outdated Show resolved Hide resolved
conduit/plugins/importers/algod/algod_importer.go Outdated Show resolved Hide resolved
conduit/plugins/importers/algod/algod_importer.go Outdated Show resolved Hide resolved
conduit/plugins/importers/algod/algod_importer.go Outdated Show resolved Hide resolved
const (
retries = 5
var (
waitForRoundTimeout = 5 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

A more conservative timeout would be 45 seconds. I agree that we want to endow conduit with a greater amount of determinism regarding the outcome of each call to the waitForBlock endpoint. So it's a good idea to make the call to the endpoint timeout on it's own terms rather than the endpoint's as is being done in the PR. On the other hand, we might still want the ability to keep 10 threads of the algod importer running concurrently after we've all caught up, and a 45 sec timeout would allow for that. If we narrow the timeout to 5 secs, we essentially only allow one or two algod importer threads to run at a time (probably only one due to round time variability).

On the other hand, we can change the value as aggressively as in the PR, and if the need arises in the future to raise it back to 45 secs we can do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The low timeout was intended for responsiveness, basically when the node is stalled the timeout needs to elapse before the first recovery attempt. If there's a timeout I'm expecting the pipeline to retry the call.

The default retry count is 5, now I'm wondering if it should be unlimited.

The old Indexer had a package called fetcher, I wonder if we should bring that back to manage more optimal round caching: https://github.com/algorand/indexer/blob/master/fetcher/fetcher.go#L1

Copy link
Contributor

@tzaffi tzaffi Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A worthwhile thought for a future PR or even the pipelining effort. (suggest keeping this thread unresolved for future reference)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the default retry timeout to 0 in a followup PR, it's probably a good default anyway since people have expressed appreciation for Indexer working that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winder winder requested a review from tzaffi July 21, 2023 14:43
Copy link
Contributor

@tzaffi tzaffi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, even though I'm still curious if creating an E2E test is viable. That can be left as a future exercise.

@winder winder merged commit f2711fa into algorand:master Jul 21, 2023
3 checks passed
@winder winder deleted the will/update-sync-on-error branch July 21, 2023 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug-Fix PR proposing to fix a bug Team Lamprey
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants